perf: parallelize commit phase and reduce allocations#2
Closed
perf: parallelize commit phase and reduce allocations#2
Conversation
* chore: add toolchain and formatting config
Pin Rust 1.88 with minimal profile (cargo, rustc, clippy, rustfmt).
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(ci): switch to actions-rust-lang/setup-rust-toolchain
Respects rust-toolchain.toml automatically. Also normalize clippy
flags to use --all --all-targets consistently.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(primitives): add u128/i128 serialization support
Required by the Fp128 field backend.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(algebra): add prime fields, extensions, and modules
Introduces the algebra module with:
- Fp32/Fp64/Fp128 prime field backends with branchless constant-time
add/sub/neg and rejection-sampled random
- U256 helper for Fp128 wide multiplication
- Fp2/Fp4 tower extensions with Karatsuba-ready structure
- VectorModule<F, N> fixed-length vector module over any field
- Poly<F, D> fixed-size polynomial container
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(algebra): add NTT small-prime arithmetic and CRT helpers
Adds the ntt submodule with:
- NttPrime: per-prime Montgomery-like fpmul, Barrett-like fpred,
branchless csubq/caddq/center
- LimbQ/QData: radix-2^14 limb arithmetic for big-q coefficients
- logq=32 parameter preset (six NTT-friendly primes, CRT constants)
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(algebra): add comprehensive algebra test suite
24 tests covering:
- Field arithmetic, identities, and distributivity (Fp32/Fp64/Fp128)
- Zero inversion returns None
- Serialization round-trips (all field types, extensions, VectorModule)
- Fp2 conjugate, norm, and distributivity
- U256 wide multiply and bit access
- LimbQ round-trip and add/sub inverse
- QData consistency with preset constants
- NTT normalize range and fpmul commutativity
- Poly add/sub/neg
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs: add and update progress tracking document
Records Phase 0 status: all field types, extensions, NTT scaffolding,
constant-time arithmetic, and 24-test suite. Reflects the
fields/ntt/module/poly directory layout.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(ntt): Rust-ify NTT/CRT port from C
Overhaul the NTT small-prime arithmetic and CRT modules:
- Add MontCoeff newtype (#[repr(transparent)] i16 wrapper) to enforce
Montgomery-domain vs canonical-domain separation at the type level
- NttPrime methods now take/return MontCoeff instead of bare i16:
fpmul→mul, fpred→reduce, csubq→csubp, caddq→caddp
- Add domain conversion: from_canonical (i16→Mont), to_canonical (Mont→i16)
- Delete free functions (pointwise_mul etc), replaced by methods on NttPrime
- LimbQ: replace add_limbs/sub_limbs/less_than with std Add/Sub/Ord impls
- LimbQ: replace from_u128/to_u128 with From<u128>/TryFrom for u128
- LimbQ: add Display impl, branchless csub_mod
- Rename all LABRADOR* constants to project-native Q32_* names
- Add #[cfg(test)] verification that re-derives pinv/v/mont/montsq from p
- Add MontCoeff round-trip and LimbQ ordering tests (28 total)
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore: remove section banners, update progress doc
Remove // ---- Section ---- banner comments from prime.rs and crt.rs.
Add non-negotiable rules to HACHI_PROGRESS.md:
- No section-banner comments
- No commit/push without explicit user approval
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(ring): add CyclotomicRing, CyclotomicNtt, and NTT butterfly
Milestone 1 - CyclotomicRing<F, D> (coefficient form):
- Schoolbook negacyclic convolution Mul (X^D = -1)
- Add/Sub/Neg/AddAssign/SubAssign/MulAssign, scale, zero/one/x
- HachiSerialize/HachiDeserialize
Milestone 2 - NTT butterfly + CyclotomicNtt<K, D>:
- Merged negacyclic Cooley-Tukey forward NTT (twist folded into twiddles)
- Gentleman-Sande inverse NTT with D^{-1} scaling
- Runtime primitive-root finder and twiddle table computation
(TODO: migrate to compile-time const tables)
- CyclotomicNtt with per-prime pointwise Add/Sub/Neg/Mul
- Ring<->Ntt transforms with CRT reconstruction
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(algebra): add ring and NTT tests, wrap in mod tests
Add 12 new tests:
- CyclotomicRing: negacyclic X^D=-1, mul identity/zero, commutativity,
distributivity, associativity, additive inverse, serde, degree-64
- NTT: forward/inverse round-trip (single prime + all primes),
NTT mul matches schoolbook cross-check
Wrap all integration tests in a single mod tests block and remove
section-banner comments.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(algebra): harden ring-NTT conversion and field decoding
Constrain ring/NTT conversions to explicit field backends and replace fragile CRT reconstruction with deterministic modular lifting. Enforce canonical deserialization checks in validated field decoding paths to reject malformed encodings.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(algebra): add CRT round-trip and serialization guard coverage
Add end-to-end ring->NTT->ring CRT round-trip tests plus reduced-ops stability checks. Expand serialization coverage for Fp4/Poly and verify checked deserialization rejects non-canonical field encodings.
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(bench): add ring_ntt benchmark target and CT tracking docs
Add a dedicated ring/NTT benchmark harness and register it in Cargo metadata. Record current constant-time review status and sync the implementation progress board with new milestones and test coverage.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(field): split core, canonical, and sampling capabilities
Break the monolithic Field trait into FieldCore, CanonicalField, and FieldSampling, and update algebra primitives to depend on explicit capabilities for cleaner semantics and future backend integration.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(fields): add pow2-offset pseudo-mersenne registry and checks
Introduce the curated 2^k-offset prime registry and typed field aliases, then add dedicated Miller-Rabin regression tests to enforce probable primality for all enabled profiles.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(ring): introduce crt-ntt backend/domain layering
Rename the ring NTT representation to explicit CRT+NTT semantics and route conversions through backend traits, adding scalar backend and domain aliases for a cleaner representation-vs-execution boundary.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(algebra): cover backend parity and pow2-offset invariants
Expand algebra tests to validate default-vs-backend CRT+NTT equivalence, sampling bounds, and pow2-offset registry consistency under the new field and ring abstractions.
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs(algebra): update progress notes and add prime analysis references
Refresh progress and constant-time notes to match the new CRT+NTT naming and field scope, and add the NTT prime analysis document plus local NIST standards artifacts used for parameter rationale.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(algebra): harden fp128 reduction and CRT reconstruction arithmetic
Make Fp128 reduction and CRT inner accumulation paths more timing-stable with branchless modular operations, and refresh ring/docs/tests status after the hardening cleanup pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(protocol): add transcript and commitment scaffold
Introduce Hachi protocol-layer interfaces and placeholder types with Blake2b/Keccak transcript backends plus phase-aligned labels, while making transcript absorption label-directed at call sites.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(protocol): add transcript and commitment contract coverage
Add deterministic transcript schedule checks (including keccak) and protocol commitment contract tests so transcript ordering and challenge derivation behavior are locked down.
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs(protocol): align transcript spec and progress status
Document the protocol scaffold as in-progress, capture the commitment-focused transcript label vocabulary, and clarify deferred Jolt adapter expectations.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(protocol): add ring commitment core and seeded matrix derivation
Implement the ring-native commitment setup/commit core with config validation, utility modules, and seeded domain-separated public matrix derivation, while wiring prover/verifier stub modules for the next open-check phase.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(protocol): consolidate ring commitment and stub contract coverage
Unify ring commitment core and config validation checks in one test file and add explicit prover/verifier stub contract tests to lock current placeholder behavior before open-check implementation.
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs(progress): update phase 2 status after commitment core landing
Record that ring-native §4.1 commitment setup/commit and protocol wiring are in place, and clarify that open-check prove/verify remains the next unfinished protocol milestone.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(algebra): harden CT inversion path and CRT final projection
Add a constant-time inversion helper for prime fields and replace scalar CRT's final `% q` projection with a division-free fixed-iteration reducer, so secret-bearing arithmetic paths avoid variable-latency behavior.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(algebra): rename inversion helper API without ct suffix
Rename the secret-path inversion helper to `Invertible::inv_or_zero` while preserving constant-time semantics via doc contracts, and update CT tracking docs to match the new API names.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(algebra): clean inversion test naming and normalize formatting
Rename the inversion helper test to match the new API naming and keep the ring commitment test formatting consistent after linting.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(protocol): add sumcheck core module and tests
Introduce core sumcheck building blocks (univariate messages, compression, and transcript-driving prover/verifier driver) and add unit/integration tests. Update progress doc to reflect sumcheck core landing.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Add reference PDF papers
* Add local agent instruction files
* Add Hachi and SuperNOVA digest docs
* Add general field, ring, and multilinear utilities
* Add sparse Fiat-Shamir challenge sampling
* Implement Polynomail Evaluation as Quadradic Equation
* Rename stub to prover and verifier
* Refactor code organization
* Replace decopose with balanced decompose
* Transform polynomial over Fq to ring
* Refactor function names
* Impl commitment_scheme API
* Add SolinasFp128 backend for sparse 128-bit primes
Introduce `SolinasFp128` with two-fold Solinas reduction for `p = 2^128 - c` (sparse `c`), plus `U256::sqr_u128`. Export descriptive prime aliases, add BigUint-backed correctness tests, and include a Criterion bench for mul/inv.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Tighten docs and minor clippy cleanups
Add missing rustdoc Errors/Panics sections and apply small simplifications suggested by clippy.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Add reduction steps to iteration prover
* Optimize Solinas mul/add/sub: fused u64-limb schoolbook + csel canonicalize
Rewrite mul_raw as a fused 2×2 schoolbook multiply with two-fold Solinas
reduction using explicit u64 limbs and mac helper, bypassing U256.
Replace mask-based canonicalize with carry-flag-based pattern that compiles
to adds+adcs+csel+csel (4 insns) instead of 10 on AArch64.
Add pure-mul, sqr, and throughput microbenchmarks.
Made-with: Cursor
* Switch SolinasFp128 repr from u128 to [u64; 2] for 8-byte alignment
Storage is now [u64; 2] (lo, hi) which halves alignment from 16 to 8
bytes, improving struct packing. Arithmetic hot paths convert to u128
for LLVM-optimal codegen (adds/adcs pairs), so no perf regression.
Made-with: Cursor
* Fuse overflow correction with canonicalize in fold2_canonicalize
When fold-2 overflows, the wrapped value s < C², so s + C < C(C+1) < P —
meaning s + C is already canonical. This lets us replace the separate
overflow-correction + canonicalize (3 + 4 insns) with a single fused
`if (overflow | carry) { s + C } else { s }` select, saving 2 instructions
on the critical path. Add compile-time assertion enforcing C(C+1) < P.
Made-with: Cursor
* Unify Fp128 with Solinas-optimized arithmetic, delete SolinasFp128
Replace the generic Fp128<const MODULUS: u128> (binary-long-division via
U256) with the Solinas-optimized implementation. Fp128<const P: u128>
now uses [u64; 2] storage, fused schoolbook 2x2 + two-fold Solinas
reduction (~23 cycles/mul on AArch64/x86-64), and compile-time
validation that P = 2^128 - C with C < 2^64.
Delete SolinasFp128, SolinasParams, solinas128.rs, and u256.rs. All
call sites updated; prime type aliases (Prime128M13M4P0 etc.) are now
simple Fp128<...> aliases in fp128.rs. Blanket PseudoMersenneField impl
for all Fp128<P>.
Made-with: Cursor
* Use git deps for ark-bn254/ark-ff instead of local paths
Switch from local path dependencies to the a16z/arkworks-algebra
git repo (branch dev/twist-shout) so collaborators can compile
without needing a local checkout of arkworks-algebra-jolt.
Made-with: Cursor
* Add template for sumchecks
* Optimize Fp128 mul path and expand Rust field benchmarks.
Refine Fp128 multiply/fold carry handling for better generated code and add isolated, passthrough, independent, and long-chain Rust microbenches to separate latency and throughput effects when comparing against BN254.
Made-with: Cursor
* Add 2^a±1 Fp128 reduction specialization and benches.
Detect C = 2^a ± 1 at compile time and route fold multiplications through a specialized shift-based path with generic fallback, plus add benchmark coverage for sparse 128-bit primes using this shape.
Made-with: Cursor
* Add packed Fp128 field backend scaffolding and focused benchmarks.
This introduces AArch64-first packed field abstractions with a scalar fallback and adds dedicated field-only validation/benchmark coverage before any ring or protocol integration.
Made-with: Cursor
* Refactor packed Fp128 backend to true SoA layout and stabilize benchmarking.
This switches packed lane storage to SoA with NEON add/sub kernels and a SoA mul path, and updates packed-field APIs and benches so scalar-vs-packed latency/throughput comparisons are measured consistently.
Made-with: Cursor
* Optimize packed Fp128 mul throughput with array-backed SoA lanes.
This keeps mul in true SoA form while removing repeated vector transmute overhead and inlining the limb-level Solinas lane kernel, improving packed mul throughput and latency against scalar baselines.
Made-with: Cursor
* Add Fp128 widening multiply API and specialized Solinas reduction
Expose mul_wide_u64, mul_wide, mul_wide_u128, solinas_reduce, and
to_limbs for deferred-reduction patterns needed by jolt-hachi.
Hand-optimized reduce paths for 3/4/5 limbs avoid generic loop
overhead. Refactor mul_raw to reuse mul_wide + reduce_4 (zero
overhead). Add 9 unit tests and widening/accumulator benchmarks.
Made-with: Cursor
* Clean up fp128: remove section banners, hoist std::ops imports, rename mul_wide free fn
Rename free function mul_wide → mul64_wide to avoid shadowing
Fp128::mul_wide. Move reduce_4 next to fold2_canonicalize. Replace
fully qualified std::ops::{Add,Sub,Mul,Neg} with use imports.
Made-with: Cursor
* Constrain Fp32/Fp64 to pseudo-Mersenne primes with Solinas reduction
Rework fp32.rs and fp64.rs to require p = 2^k - c (small c), matching
fp128's design. Compile-time constants BITS/C/MASK derived from P with
static assertions. Replace bit-serial reduction with two-fold Solinas
reduction (reduce_product for hot path, loop-based reduce_u64/u128 for
arbitrary inputs). Add widening ops (mul_wide, square, solinas_reduce).
Fix FieldSampling to use direct modular reduction instead of rejection
sampling. Blanket-impl PseudoMersenneField, remove manual impls. Rename
const generic MODULUS -> P at all call sites. Add latency + throughput
benchmarks. Hoist mid-function imports in tests/algebra.rs.
Made-with: Cursor
* Specialize Fp64 sub-word primes to u64-only arithmetic
For BITS < 64 (e.g. 2^40-195), avoid u128 intermediates in
reduce_product, add_raw, and sub_raw. Use mul_c_narrow which splits
C*high into u32x32->u64 widening multiplies (umaddl on AArch64),
preventing LLVM from promoting to u128. Brings 40-bit mul throughput
within 4% of 64-bit (690 vs 716 Melem/s), up from ~20% gap.
Made-with: Cursor
* Add 2^30 and 2^31 pseudo-Mersenne primes and expand benchmarks
Add Pow2Offset30Field (2^30-35) and Pow2Offset31Field (2^31-19) prime
definitions and type aliases. Refactor fp32/fp64 latency benchmarks with
chain_bench! macro, add throughput benchmarks for all new primes.
Made-with: Cursor
* Add NEON packed backends for Fp32 (4-wide) and Fp64 (2-wide)
PackedFp32Neon: 4 lanes in uint32x4_t with full NEON Solinas reduction
for mul (vmull_u32 + 2-fold reduce), umin trick for add/sub (BITS<=31),
overflow-aware paths for BITS==32. C_SHIFT_KIND optimization for
C=2^a+/-1.
PackedFp64Neon: 2 lanes in uint64x2_t with NEON add/sub (conditional
P for BITS<=62, carry-aware for BITS>=63), scalar-per-lane mul (no
native 64x64->128 on NEON).
Fp32 packed achieves 2.4-3.5x mul throughput and 3.5-5.0x add/sub
throughput over scalar. Includes HasPacking impls, type aliases,
NoPacking fallbacks, 7 correctness tests, and throughput benchmarks.
Made-with: Cursor
* Optimize packed Fp32/Fp64 Solinas multiply hot paths on NEON
For packed Fp32, remove the shift/add C-special-case in the Solinas fold and
always use vmull_u32 with a hoisted C broadcast, which improves stability and
removes the 24-bit mul regression. For packed Fp64, replace per-lane Fp64
wrapper multiplication with packed-local per-lane 64x64->128 products plus
specialized Solinas reduction (including the sub-word u64 fold path), reducing
mul overhead for both 40-bit and 64-bit packed variants.
Made-with: Cursor
* Tune packed Fp64 mul folding and add reducer/codegen probes
Switch packed Fp64 sub-word fold multiplication to direct `C*x`, which improves packed mul throughput in repeated A/B runs. Add dedicated reducer and codegen probe benches so we can compare 40-bit and 64-bit fold paths with instruction-level visibility.
Made-with: Cursor
* Optimize x86 BMI2 multiply paths for fp64/fp128 fields
Use BMI2 widening multiplies in scalar field hot paths and specialize x86 sub-word fold multiplication to a single 64-bit multiply, improving 40-bit fp64 throughput while keeping 64-bit and 128-bit paths stable.
Made-with: Cursor
* Optimize fp128 wide-limb multiply path for Jolt integration
Raise Hachi MSRV to 1.88, add specialized Fp128 mul_wide_limbs kernels for M={3,4} and OUT={4,5,6}, and add field_arith benches that track mul_wide_limbs-only and roundtrip costs to catch regressions.
Made-with: Cursor
* Specialize Fp128 CanonicalField small-int constructors
Make from_u64 use a direct canonical limb construction (no reduction path), fix from_i64 to use unsigned_abs to avoid i64::MIN overflow, and add a regression test for the min-value case.
Made-with: Cursor
* Impl sumchecks for hachi
* Add optimized one-hot commitment path for regular sparse witnesses
Exploits the structure of one-hot vectors (T chunks of K field elements,
each chunk with exactly one 1) to eliminate all inner ring multiplications.
Gadget decomposition of {0,1} coefficients is trivial (only level-0 digit
is nonzero), and the inner Ajtai t = A*s reduces to summing selected
columns of A with O(D) negacyclic rotations instead of O(D^2) ring muls.
Handles both K >= D and D >= K as long as one divides the other:
- K >= D: each nonzero ring element is a monomial X^j (single rotation)
- D >= K: each ring element is a sum of D/K monomials (multiple rotations)
Total inner cost: N_A * T * D coefficient additions (zero multiplications),
vs N_A * 2^M * delta * D^2 coefficient multiplications in the dense path.
Made-with: Cursor
* Apply rustfmt formatting to fp128 and field_arith bench
Made-with: Cursor
* Inject sumchecks to Hachi prover
* Add commitment to w to transcript
* Add AVX2 and AVX-512 packed field backends for Fp32, Fp64, Fp128
Implement vectorized SIMD arithmetic for x86_64:
- AVX2: 8-wide Fp32, 4-wide Fp64, 2-wide Fp128 (scalar delegation)
- AVX-512: 16-wide Fp32, 8-wide Fp64, 4-wide Fp128 (scalar delegation)
Fp32 uses even/odd lane split with 2-fold Solinas reduction.
Fp64 uses vectorized 64×64→128 schoolbook multiply (adapted from
plonky3 Goldilocks) with custom Solinas reduction for pseudo-Mersenne
primes p = 2^k - c.
Also: extract NEON backend into packed_neon.rs, add cfg-gated module
selection (AVX-512 > AVX2 > NEON > NoPacking), enable nightly
stdarch_x86_avx512 feature, add sumcheck-mix benchmark, and fix minor
clippy lints in fp64/fp128.
Made-with: Cursor
* Vectorize Fp128 packed add/sub on AVX-512 (8-wide) and AVX2 (4-wide)
Convert Fp128 packed backends from scalar delegation (AoS) to SoA layout
with vectorized add/sub via __m512i / __m256i. Mul remains scalar per-lane.
Add FIELD_OPS_PERF.md with Zen 5 benchmark results.
Fp128 packed add: +114% (1.08 → 2.31 Gelem/s on Zen 5 AVX-512)
Fp128 packed sub: +137% (1.34 → 3.18 Gelem/s)
Made-with: Cursor
* Add M4 Pro NEON benchmarks, remove mul_add experiment
Populate FIELD_OPS_PERF.md with Apple M4 Pro (NEON) results for all
primes across scalar, packed, and sumcheck MACC workloads. Remove
the experimental mul_add trait method (vectorized add already optimal
after inlining; scalar fused approach was 16% slower).
Made-with: Cursor
* Change sumcheck API
* Separate ring switch logic
* Rename sumchecks to NormSumcheck and RelationSumcheck
* Remove iteration prover
* Eliminate O(D^2) schoolbook ring multiplication from protocol hot paths
At production parameters (D=256/1024), schoolbook CyclotomicRing
multiplication is catastrophically expensive. Every protocol hot path
has exploitable operand structure that avoids the full D^2 cost:
- Add CyclotomicRing::mul_by_sparse for O(omega*D) sparse challenge
multiplication (90-140x speedup in compute_z_hat)
- Change RingOpeningPoint to store Vec<F> scalars; use scale() instead
of ring mul in compute_w_hat (256-1024x speedup)
- Add kron_scalars, kron_row_scale, kron_sparse_scale; refactor
generate_m to use scalar-aware Kronecker products
- Add zero-skip and scalar-detect in compute_r_via_poly_division
- Add sample_sparse_challenges, store Vec<SparseChallenge> in
QuadraticEquation throughout prover and verifier paths
Made-with: Cursor
* lint: section banner removal, naming hoist, cfg(test) for test-only paths
- Remove section banner comments (----, =====) repo-wide in src, tests, benches
- commitment_scheme: hoist RingCommitment, RingOpeningPoint, transcript labels
to top-level use; add #[cfg(test)] use for rederive_alpha_and_m_a body
(Blake2bTranscript, eval_ring_matrix_at, expand_m_a, labels) so that
function uses short names without polluting lib build
- Leave mod tests imports in place (no hoisting of test-module use blocks)
Made-with: Cursor
* Fix CI issues
---------
Co-authored-by: Quang Dao <quang.dao@layerzerolabs.org>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ult) - New src/parallel.rs with cfg_iter!/cfg_into_iter!/cfg_chunks! macros that dispatch to rayon parallel iterators when `parallel` is enabled - Parallelize protocol hot paths: ring polynomial division, w_evals construction, M_alpha evaluation, ring vector evaluation, packed ring poly evaluation, coefficients-to-ring reduction, quadratic equation folding, and sumcheck round polynomial computation - All 174 tests pass with and without the parallel feature Made-with: Cursor
- Make HachiCommitmentScheme generic over <const D, Cfg> so different configs (and thus num_vars) can be used without code duplication. - Remove hardcoded DefaultCommitmentConfig::D from ring_switch.rs; WCommitmentConfig and commit_w now flow D generically. - Add benches/hachi_e2e.rs with configs sweeping nv=10,14,18,20. Made-with: Cursor
Make NTT primitives (NttPrime, NttTwiddles, MontCoeff, CyclotomicCrtNtt) generic over PrimeWidth (i16/i32) instead of hardcoding i16. Replace the monolithic QData struct with separate GarnerData and per-prime NttPrime arrays. Add Q128 parameter set (5 × i32 primes, D ≤ 1024) alongside the existing Q32 set. Simplify ScalarBackend by removing the const-generic limb count from to_ring_with_backend. Made-with: Cursor
Split CanonicalField into FromSmallInt (from_{u,i}{8,16,32,64} for all
fields) and CanonicalField (u128 repr, base fields only). Implement
FromSmallInt, Eq, Debug for Fp2/Fp4. Add ExtField<F> trait with
EXT_DEGREE and from_base_slice.
Optimize extension field arithmetic: Karatsuba multiplication for Fp2
and Fp4 (3 base muls instead of 4), specialized squaring (2 base muls
for Fp2), non-residue IS_NEG_ONE specialization. Add concrete configs
(TwoNr, NegOneNr, UnitNr) and type aliases Ext2<F>, Ext4<F>.
Add transpose-based packed extension fields (PackedFp2, PackedFp4)
for SIMD acceleration, following Plonky3's approach.
Relax sumcheck bounds from E: CanonicalField to E: FromSmallInt (or
E: FieldCore where spurious). Add sample_ext_challenge transcript
helper. Includes tests for extension field sumcheck execution.
Made-with: Cursor
Correctness fixes: - Rewrite negacyclic NTT as twist + cyclic DIF/DIT pair (no bit-reversal permutation), correctly diagonalizing X^D+1. - Center coefficient→CRT mapping and Garner reconstruction to handle negacyclic sign wrapping consistently. - Fix i32 Montgomery csubp/caddp overflow via branchless i64 widening. - Fix q128 centering overflow in balanced_decompose_pow2 (avoid casting q≈2^128 into i128). - Remove dense-protocol schoolbook fallback; all mat-vec now routes through CRT+NTT. Performance optimizations: - Precompute per-stage twiddle roots in NttTwiddles (eliminate runtime pow_mod per butterfly stage). - Forward DIF butterfly skips reduce_range before Montgomery mul (safe because mul absorbs unreduced input). - Hoist centered-coefficient computation out of per-prime loop in from_ring. - Add fused pointwise multiply-accumulate for mat-vec inner loop. - Add batched mat_vec_mul_crt_ntt_many that precomputes matrix NTT once and reuses across many input vectors. - Wire commit_ring_blocks to batched A*s path. Benchmarks (D=64, Q32/K=6): - Single-prime forward+inverse NTT: 1.14µs → 0.43µs (2.7x) - CRT round-trip: 10.7µs → 6.3µs (1.7x) - Commit nv10: ~70% faster, nv20: ~47% faster Made-with: Cursor
…rsion The dense mat-vec paths (commit_ring_blocks, commit_onehot B-mul, compute_v) previously converted coefficient-form matrices to CRT+NTT on every call. Now the setup eagerly converts A, B, D into an NttMatrixCache and all dense operations use the pre-converted form. Coefficient-form matrices are retained for the onehot inner-product path and ring-switch/generate_m. Made-with: Cursor
…nd extract shared field utilities - Delete unused HachiRoutines trait and dead algebra/domains/ module - Remove redundant FieldCore::add/sub/mul and Module::add/neg (covered by ops traits) - Extract is_pow2_u64, log2_pow2_u64, mul64_wide into fields/util.rs to deduplicate Made-with: Cursor
Replace separate blake2b.rs and keccak.rs with a single generic HashTranscript<D: Digest> parameterized by hash function. Blake2bTranscript and KeccakTranscript are now type aliases. Made-with: Cursor
…ename configs, add soundness test - Fix CompressedUniPoly::degree() off-by-one that could let malformed proofs pass - Split sumcheck/mod.rs: extract types into types.rs, relocate multilinear_eval and fold_evals to algebra/poly.rs - Replace allocating fold_evals with in-place fold_evals_in_place - Add debug_assert guards to multilinear_eval and fold_evals_in_place - Introduce CommitWitness struct to replace error-prone 3-tuple returns - Rename DefaultCommitmentConfig to SmallTestCommitmentConfig, add ProductionFp128CommitmentConfig - Add verify_rejects_wrong_opening negative test for verifier soundness Made-with: Cursor
Use iter().enumerate() for schoolbook convolution loops and array::from_fn for pointwise NTT operations. Made-with: Cursor
This removes compile-time commitment shape locks, derives beta from runtime layout, and threads layout-aware setup through commit/prove/verify with setup serialization roundtrip coverage. Made-with: Cursor
…erflow fix - Verifier path never panics; all errors return HachiError - Bind commitment, opening point, and y_ring in Fiat-Shamir transcript - Fix i16 csubp/caddp overflow by widening to i32 - multilinear_eval returns Result with dimension checks - build_w_evals validates w.len() is a multiple of d - UniPoly::degree uses saturating_sub instead of expect - Serialize usize as u64 for 32/64-bit portability - Fix from_i64(i64::MIN) via unsigned_abs - Remove Transcript::reset from public trait (move to inherent) - Add batched_sumcheck verifier empty-input guard Made-with: Cursor
Replace inline crate::protocol::commitment::HachiCommitmentLayout,
hachi_pcs::algebra::backend::{CrtReconstruct, NttPrimeOps}, and
hachi_pcs::algebra::CyclotomicRing with top-level use imports.
Made-with: Cursor
Route small-b rounds through the point-eval interpolation kernel and keep the affine-coefficient kernel for larger b, while adding deterministic baseline-vs-dispatched benchmarks and parity tests to validate correctness across both strategies. Made-with: Cursor
Apply non-functional formatting and import ordering cleanups across commitment, ring-switch, and benchmark/test files to keep the codebase style consistent. Made-with: Cursor
Made-with: Cursor
Change coefficient-to-ring packing from strided to sequential, enabling true streaming where each trace chunk maps to exactly one inner Ajtai block. Implement StreamingCommitmentScheme for HachiCommitmentScheme. - reduce_coeffs_to_ring_elements: sequential packing (chunks_exact(D)) - prove/verify: opening point split flipped to (inner, outer) - ring_opening_point_from_field: outer split flipped to (M first, R second) - commit_coeffs: sequential block distribution - map_onehot_to_sparse_blocks: sequential block distribution - HachiChunkState + process_chunk / process_chunk_onehot / aggregate_chunks - Streaming commit tests (matches non-streaming, prove/verify roundtrip) Made-with: Cursor
Split the monolithic verify_batched_sumcheck into three pieces: - verify_batched_sumcheck_rounds: replay rounds, return intermediate state - compute_batched_expected_output_claim: query verifier instances - check_batched_output_claim: enforce equality This enables callers (e.g. Greyhound) to intercept the intermediate sumcheck state before the final oracle check. The original function is preserved as a convenience wrapper. Made-with: Cursor
Allows None entries in one-hot index arrays to represent inactive cycles. Adds public commit_onehot free function returning both commitment and hint. Made-with: Cursor
commit_coeffs now accepts ring coefficient vectors shorter than the layout's full size, padding each block internally. prove/verify pad the opening point with zeros so the transcript stays consistent. This avoids materializing huge zero-padded field-element arrays. Made-with: Cursor
Implement HachiSerialize/HachiDeserialize for HachiProof, HachiCommitmentHint, and SumcheckAux so they can be serialized through the ArkBridge adapter in Jolt. Made-with: Cursor
Allow levels * log_basis up to 128 + log_basis. For Fp128 with LOG_BASIS=4, the decomposition needs 33 levels (132 bits total) because 32 levels can't represent the full signed range [-q/2, q/2). The extra level's digit is at most ±1 and the i128 arithmetic remains safe since the quotient shrinks monotonically. Made-with: Cursor
Same D=16 security parameters as SmallTestCommitmentConfig but derives layout from max_num_vars instead of using a fixed (4,2) shape. Made-with: Cursor
Short polynomials no longer pad to block_len. commit_coeffs accepts fewer ring elements than num_blocks * block_len, decomposes only the non-zero blocks, and fills remaining entries with zero s/t_hat without allocation or mat-vec multiplication. Also relax debug_assert in mat_vec_mul_precomputed to >= (zip handles the shorter vector correctly). Made-with: Cursor
prove expects s[i] to have inner_width entries. Use the correct length for zero blocks to match the dense path's decompose_block output size. Made-with: Cursor
CRT-NTT conversion puts ~28KB on the stack per ring element ([[MontCoeff; D]; K] + [i128; D]). With D=512 and the commit call chain depth, rayon's default thread stack overflows. ensure_large_thread_stack() is called from setup() and is safe to call multiple times (only the first configures the pool). Made-with: Cursor
Exposes MegaPolyBlock enum (Dense/OneHot/Zero) and commit_mixed() which processes heterogeneous blocks in a single commitment. This lets Jolt pack all witness polynomials into one Hachi commitment (one block per polynomial) instead of N independent commitments. Also makes SparseBlockEntry and map_onehot_to_sparse_blocks public so callers can construct one-hot block descriptors. Made-with: Cursor
The basis-decomposed s_i vectors (one per block, each block_len*delta ring elements) were stored in both CommitWitness and HachiCommitmentHint. At production parameters (D=512, block_len=2048, delta=32), each s_i is 512 MB — storing all 64 of them consumed ~32 GB. Instead, recompute s_i on the fly in compute_w_hat and compute_z_hat from ring_coeffs using decompose_block. Peak memory drops from O(blocks * block_len * delta) to O(block_len * delta) per thread. Also adds setup_with_layout for caller-specified HachiCommitmentLayout, and makes decompose_block, SparseBlockEntry, map_onehot_to_sparse_blocks public for downstream (Jolt) mega-polynomial integration. Made-with: Cursor
Keep these files locally for reference but remove from the committed tree. They can be selectively re-added later. Made-with: Cursor
… reduction Refactor the Hachi proving pipeline to eliminate the 13 GB matrix M and 2.6 GB vector z from memory, reducing peak prover allocation from ~30 GB to ~3.7 GB. Key changes: - QuadraticEquation: remove m/z fields; add compute_r_split_eq (split-eq factoring replaces full Kronecker materialization) and compute_m_a_streaming (row-at-a-time M·α evaluation). - ring_switch: decompose z_pre on the fly in build_w_coeffs; add build_w_evals_compact returning Vec<i8> for round-0 storage (all entries fit in [-8, 7] from balanced_decompose_pow2 with LOG_BASIS=4). - HachiSumcheckProver: fused norm+relation prover sharing a single w_table. Round 0 uses WTable::Compact(Vec<i8>), folding to WTable::Full(Vec<F>) at half size after the first challenge. - HachiSumcheckVerifier: fused verifier combining both oracle checks with a batching_coeff sampled from the transcript. - Remove dead batched mat-vec functions from linear.rs. - Import hygiene: shorten crate::algebra::ring::X to crate::algebra::X; hoist mid-function use statements to top-level. Made-with: Cursor
Stack sizing for D>=512 ring elements should be handled by the caller, not baked into the library's setup path. Made-with: Cursor
- Add block-level parallelism to commit_ring_blocks, commit_coeffs, commit_onehot, and commit_mixed via cfg_iter!/cfg_into_iter! - Parallelize vector-to-NTT conversion in mat_vec_mul_precomputed_with_params - Cache CRT+NTT params inside NttMatrixCache, eliminating redundant select_crt_ntt_params calls on every mat-vec multiply - Add balanced_decompose_pow2_into for in-place decomposition, removing per-element Vec allocations in decompose_block/decompose_rows - Add inner_ajtai_onehot_t_only that skips the 16MB s-vector allocation when the caller discards it (commit_onehot, commit_mixed) - Add one-hot and mixed commitment benchmarks to hachi_e2e Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
commit_ring_blocks,commit_coeffs,commit_onehot, andcommit_mixednow process blocks concurrently viacfg_iter!/cfg_into_iter!(16 blocks in parallel for production params)mat_vec_mul_precomputed_with_paramsis now parallel (2048 elements for production inner_width)NttMatrixCacheand reused across all mat-vec calls, eliminating redundantselect_crt_ntt_paramsper invocationbalanced_decompose_pow2_intowrites digit planes into caller-provided slices;decompose_block/decompose_rowspre-allocate a single output buffer;inner_ajtai_onehot_t_onlyskips the 16MBs-vector when callers discard ithachi_e2eTest plan
cargo clippy --all --message-format=short -q -- -D warningspassescargo test --libpasses (92 tests, both with and withoutparallelfeature)cargo fmtcleanMade with Cursor